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Abstract 

Protein subcellular localization prediction, as an essential step to elucidate the functions in vivo of proteins and identify 
drugs targets, has been extensively studied in previous decades. Instead of only determining subcellular localization of 
single-label proteins, recent studies have focused on predicting both single- and multi-location proteins. Computational 
methods based on Gene Ontology (GO) have been demonstrated to be superior to methods based on other features. 
However, existing GO-based methods focus on the occurrences of GO terms and disregard their relationships. This paper 
proposes a multi-label subcellular-localization predictor, namely HybridGO-Loc, that leverages not only the GO term 
occurrences but also the inter-term relationships. This is achieved by hybridizing the GO frequencies of occurrences and the 
semantic similarity between GO terms. Given a protein, a set of GO terms are retrieved by searching against the gene 
ontology database, using the accession numbers of homologous proteins obtained via BLAST search as the keys. The 
frequency of GO occurrences and semantic similarity (SS) between GO terms are used to formulate frequency vectors and 
semantic similarity vectors, respectively, which are subsequently hybridized to construct fusion vectors. An adaptive- 
decision based multi-label support vector machine (SVM) classifier is proposed to classify the fusion vectors. Experimental 
results based on recent benchmark datasets and a new dataset containing novel proteins show that the proposed hybrid- 
feature predictor significantly outperforms predictors based on individual GO features as well as other state-of-the-art 
predictors. For readers' convenience, the HybridGO-Loc server, which is for predicting virus or plant proteins, is available 
online at http://bioinfo.eie.polyu.edu.hk/HybridGoServer/. 
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Introduction 

Proteins located in appropriate physiological contexts within a 
cell are of paramount importance to exert their biological 
functions. Subcellular localization of proteins is essential to the 
functions of proteins and has been suggested as a means to 
maximize functional diversity and economize on protein design 
and synthesis [1]. Aberrant protein subcellular localization is 
closely correlated to a broad range of human diseases, such as 
Alzheimer's disease [2], kidney stone [3], primary human liver 
tumors [4], breast cancer [5], pre-eclampsia [6] and Bartter 
syndrome [7] . Knowing where a protein resides within a cell can 
give insights on drug targets identification and drug design [8,9]. 
Wet-lab experiments such as fluorescent microscopy imaging, cell 
fractionation and electron microscopy are the gold standard for 
validating subcellular localization and are essential for the design 
of high quality localization databases such as The Human Protein 
Adas (http://www.proteinatlas.org/). However, wet-lab experi- 
ments are time-consuming and laborious. With the avalanche of 
newly discovered protein sequences in the post-genomic era, 
computational methods are required to assist biologists to deal 



with large-scale proteomic data to determine the subcellular 
localization of proteins. 

Conventionally, subcellular-localization predictors can be 
roughly divided into sequence-based and annotation-based. 
Sequence-based methods use (1) amino-acid compositions 
[10,11], (2) sequence homology [12,13], and (3) sorting signals 
[14,15] as features. Annotation-based menthods use information 
beyond the protein sequences, such as Gene Ontology (GO) terms 
[16-21], Swiss-Prot keywords [22], and PubMed abstracts [23,24]. 
A number of studies have demonstrated that methods based on 
GO information are superior to methods based on sequence-based 
features [25-28]. Note that the GO database contains not only 
experimental data but also predicted data (http://www. 
geneontology.org/GO.evidence.shtml), which may be determined 
by sequence-based methods. From this point of view, the GO- 
based prediction, which uses the GO annotation database to 
retrieve GO terms, is a filtering method for sequence-based 
predictions. 

The GO comprises three orthogonal taxonomies whose terms 
describe the cellular components, biological processes, and 
molecular functions of gene products. The GO terms in each 
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taxonomy are organized within a directed acyclic graph. These 
terms are placed within structural relationships, of which the most 
important being the 'is-a' relationship (parent and child) and the 
'part-of relationship (part and whole) [29,30]. Recently, the GO 
consortium has been enriched with more structural relationships, 
such as 'positively-regulates', 'negatively-regulates' and 'has-part' 
[31,32]. These relationships reflect that the GO hierarchical tree 
for each taxonomy contains redundant information, for which 
semantic similarity over GO terms can be found. 

Instead of only determining subcellular localization of single- 
label proteins, recent studies have been focusing on predicting 
both single- and multi-location proteins. Since there exist multi- 
location proteins that can simultaneously reside at, or move 
between, two or more subcellular locations, it is important to 
include these proteins in the predictors. Actually, multi-location 
proteins play important roles in some metabolic processes that take 
place in more than one cellular compartment, e.g., fatty acid /?- 
oxidation in the peroxisome and mitochondria, and antioxidant 
defense in the cytosol, mitochondria and peroxisome [33]. 

Recently, several multi-label predictors based on GO have been 
proposed, including Plant-mPLoc [34], Virus-mPLoc [35], iLoc- 
Plant [36], iLoc-Virus [37], KNN-SVM [38], mGOASVM [39] 
and others [40,41]. These predictors have demonstrated superi- 
ority over sequence-based methods. These predictors use the 
occurrences of the GO terms but do not take the semantic 
relationships between GO terms into account. 

Since the relationship between GO terms reflects the association 
between different gene products, protein sequences annotated with 
GO terms can be compared on the basis of semantic similarity 
measures. The semantic similarity over GO has been extensively 
studied and have been applied to many biological problems, 
including protein function prediction [42,43], subnuclear locali- 
zation prediction [44], protein-protein interaction inference [45— 
47] and microarray clustering [48]. The performance of these 
predictors depends on whether the similarity measure is relevant to 
the biological problems. Over the years, a number of semantic 
similarity measures have been proposed, some of which have been 
used in natural language processing. 

Semantic similarity measures can be applied at the GO-term 
level or the gene-product level. At the GO-term level, methods are 
roughly categorized as node-based and edge-based. The node- 
based measures basically rely on the concept of information 
content of terms, which was proposed by Resnik [49] for natural 
language processing. Later, Lord et al. [50] applied this idea to 
measure the semantic similarity among GO terms. Lin et al. [51] 
proposed a method based on information theory and structural 
information. Subsequently, more node-based measures [52-54] 
were proposed. Edge-based measures are based on using the 
length or the depth of different paths between terms and/ or their 
common ancestors [55-58]. At the gene-product level, two most 
common methods are pairwise approaches [59—63] and groupwise 
approaches [64—67]. Pairwise approaches measure similarity 
between two gene products by combining the semantic similarities 
between their terms. Groupwise approaches, on the other hand, 
directly group the GO terms of a gene product as a set, a graph or 
a vector, and then calculate the similarity by set similarity 
techniques, graph matching techniques or vector similarity 
techniques. More recendy, Pesquita et al. [68] reviewed the 
semantic similarity measures applied to biomedical ontologies, and 
Guzzi et al. [69] provides a comprehensive review on the 
relationship between semantic similarity measures and biological 
features. 

This paper proposes a multi-label predictor based on hybrid- 
izing frequency of occurrences of GO terms and semantic 



similarity between the terms for protein subcellular localization 
prediction. Compared to existing multi-label subcellular-localiza- 
tion predictors, our proposed predictor has the following 
advantages: (1) it formulates the feature vectors by hybridizing 
GO frequency of occurrences and GO semantic similarity features 
which contain richer information than only GO term frequencies; 
(2) it adopts a new strategy to incorporate richer and more useful 
homologous information from more distant homologs rather than 
using the top homologs only; (3) it adopts an adaptive decision 
strategy for multi-label SVM classifiers so that it can effectively 
deal with datasets containing both single-label and multi-label 
proteins. Results on two recent benchmark datasets and a new 
dataset containing novel proteins demonstrate that these three 
properties enable the proposed predictor to accurately predict 
multi-location proteins and outperform several state-of-the-art 
predictors. 

Methods 

Legitimacy of Using GO Information 

Despite their good performance, GO-based methods have 
received some criticisms from the research community. The main 
argument of these criticisms is that the cellular component GO 
terms already have the cellular component categories, i.e., if the 
GO terms are known, the subcelluar locations will also be known. 
The prediction problem can therefore be easily solved by creating 
a lookup table using the cellular component GO terms as the keys 
and the cellular component categories as the hashed values. Such a 
naive solution, however, will lead to very poor prediction 
performance, as demonstrated and explained in our previous 
studies [28,39]. A number of studies [70-72] by other groups also 
strongly support the legitimacy of using GO information for 
subcellular localization. For example, as suggested by [72], the 
good performance of GO-based methods is due to the high 
representation power of the GO space as compared to the 
Euclidean feature spaces used by the conventional sequence-based 
methods. 

Retrieval of GO Terms 

The proposed predictor can use either the accession numbers 
(AC) or amino acid (AA) sequences of query proteins as input. 
Specifically, for proteins with known ACs, their respective GO 
terms are retrieved from the Gene Ontology annotation (GOA) 
database (http://www.ebi.ac.uk/GOA) using the ACs as the 
searching keys. For proteins without ACs, their AA sequences are 
presented to BLAST [73] to find their homologs, whose ACs are 
then used as keys to search against the GOA database. 

While the GOA database allows us to associate the AC of a 
protein with a set of GO terms, for some novel proteins, neither 
their ACs nor the ACs of their top homologs have any entries in 
the GOA database; in other words, no GO terms can be retrieved 
by using their ACs or the ACs of their top homologs. In such case, 
the ACs of the homologous proteins, as returned from BLAST 
search, will be successively used to search against the GOA 
database until a match is found. With the rapid progress of the 
GOA database, it is reasonable to assume that the homologs of the 
query proteins have at least one GO term [17]. Thus, it is not 
necessary to use back-up methods to handle the situation where no 
GO terms can be found. The procedures are outlined in Fig 1. 

GO Frequency Features 

Let W denote a set of distinct GO terms corresponding to a data 
set. W is constructed in two steps: (1) identifying all of the GO 
terms in the dataset and (2) removing the repetitive GO terms. 
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Retrieve a set of GO terms 



Using back-up methods 




Feature Vectors 
Construction 



Figure 1. Procedures of retrieving GO terms. Q,: the /-th query protein; A: max : the maximum number of homologs retrieved by BLAST with the 
default parameter setting; Q,-^,: the set of GO terms retrieved by BLAST using the A',-th homolog for the /-th query protein Q,; k t : the /q-th homolog 
used to retrieve the GO terms. 
doi:1 0.1 371 /journal.pone.0089545.g001 



Suppose W distinct GO terms are found, i.e., |W| = W; these GO 
terms form a GO Euclidean space with W dimensions. For each 
sequence in the dataset, a GO vector is constructed by matching 
its GO terms against W, using the number of occurrences of 
individual GO terms in W as the coordinates. Specifically, the GO 
vector p,- of the (-th protein P ; - is defined as: 



Pf = [%,••• Ay, • 



GO hit 
otherwise 



where ft i is the number of occurrences of the y'-th GO term (term- 
frequency) in the /-th protein sequence. The rationale is that the 
term-frequencies contain important information for classification. 
Note that bi/s are analogous to the term-frequencies commonly 
used in document retrieval. 

Similarly, for the t-th query protein Q the GO frequency 
vector is defined as: 



qf = [bo,- ■ ■ ,b t j, ■ ■ ■ ,b t ,w} T ' b tj ■■ 



ftj 
0 



GO hit 
otherwise 



(2) 



In the following sections, we use the superscript F to denote the 
GO frequency features in Eq. 2. 

Semantic-Similarity Features 

Semantic similarity (SS) is a measure for quantifying the 
similarity between categorical data (e.g., words in documents), 
where the notion of similarity is based on the likeliness of 
meanings in the data. It is originally developed by Resnik [49] for 
natural language processing. The idea is to evaluate semantic 
similarity in an 'is-a' taxonomy using the shared information 



contents of categorical data. In the context of gene ontology, the 
semantic similarity between two GO terms is based on their most 
specific common ancestor in the GO hierarchy. The relationships 
between GO terms in the GO hierarchy, such as 'is-a' ancestor- 
child, or 'part-of ancestor-child can be obtained from the SQL 
database through the link: http://archive.geneontology.org/ 
latest-termdb/go_daily-termdb-tables.tar.gz. Note here only the 
'is-a' relationship is considered for semantic similarity analysis 
[5 1] . Specifically, the semantic similarity between two GO terms 
x and j is defined as [49]: 



sim(x,y) = max 



ceA(x,y) I 



log(p(c))], 



(3) 



where A(x,y) is the set of ancestor GO terms of both x andjc, and 
p(c) is the probability of the number of gene products annotated to 
the GO term c divided by the total number of gene products 
annotated in the GO taxonomy. 

While Resnik's measure is effective in quantifying the shared 
information between two GO terms, it ignores the distance 
between the terms and their common ancestors in the GO 
hierarchy. To further incorporate structural information from 
the GO hierarchy into the similarity measure, we have 
explored three extension of Resnik's measure, namely Lin's 
measure [51], Jiang's measure [74], and relevance similarity 
(RS) [52] . 

Given two GO terms x andjy, the similarity by Lin's measure is: 



simL in (x,y) = simi(x,y)= max 

cgA(x,v) 



2- [- log (p(c))] 
log (/>(*))- log (/>(>')) 



(4) 
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The similarity by Jiang's measure is: 
sim Jiang (x,y) = sim 2 {x,y) 



~ccA(x,y) V 1 - log (p(x)) - log (p(y)) + 2 - [ - log 0(c)). 




GO terms, while GO SS features (Eq. 4 to Eq. 6) use the semantic 
similarity between GO terms. These two features are developed 
from two different perspectives. It is therefore reasonable to 
believe that these two kinds of information complement each 
other. Based on this assumption, we combine these two GO 
features and form a hybridized vector as: 



The similarity by RS is calculated as: 
sim RS (x,y) = sim^(x,y) 



, 2- [-log Q(c))] \ (6) 

S» {-\og( P (x))-\o g (p(y)) ,(1 - P(C)) ■ 



Among the three measures, simu n (x,y) and simj iang {x,y) are 
relative measures that are proportional to the difference in 
information content between the terms and their common 
ancestors, which is independent of the absolute information 
content of the ancestors. On the other hand, sim R s(x,y) 
incorporates the probability of annotating the common ancestors 
as a weighing factor to Lin's measure. To simplify notations, we 
refer sim Lin (x,y), sim Jia „ g (x,y) and sim RS (x,y) as sim\{x,y), 
sim 2 (x,y) and simi,{x,y), respectively. 

Based on the semantic similarity between two GO terms, we 
adopted a continuous measure proposed in [48] to calculate the 
similarity between two proteins. Specifically, given two proteins P, 
and P/, we retrieved their corresponding GO terms P, and Vj as 
described in the subsection "Retrieval of GO Terms". (Note that 
strictiy speaking, "P, should be Vjfc, where kj is the fc,-th homolog 
used to retrieve the GO terms for the i-th protein. To simplify 
notations, we write it as Vj) Then, we computed the semantic 
similarity between two sets of GO terms {Pj,Vj} as follows: 



Si(Vi,Vj) = X maxy^ P .simi(x,y), 



(7) 



xePj 



where le{ 1,2,3}, and simi(x,y) is defined in Eq. 4 to Eq. 6. 
Si(Pj,Vj) is computed in the same way by swapping V, and Vj. 
Finally, the overall similarity between the two proteins is given by: 



SS,(Vi,Vj) = 



(8) 



where le{ 1,2,3}. In the sequel, we refer the SS measures by Lin, 
Jiang and RS to as SSI, SS2 and SS3, respectively. 

Thus, for a testing protein Q ( with GO term set Q t , a GO 
semantic similarity (SS) vector qf' can be obtained by computing 
the semantic similarity between Q t and each of the training 
protein {P;}fL i , where N is the number of training proteins. Thus, 
Q, can be represented by an iV-dimensional vector: 

qf' = [SSi(Q„Pi), ■ ■ -,SS,iQt,V,), • • ■ ,SS,(Q t ,V N )] T , (9) 

where le{ 1,2,3}. In other words, qf' represents the SS vector by 
using the l-th SS measure. 

Hybridization of Two GO Features 

As can be seen from the subsections "GO Frequency Features" 
and "Semantic-Similarity Features", we know that the GO 
frequency features (Eq. 2) use the frequency of occurrences of 



Lq/ J 



(10) 



where /e{l,2,3}. In other words, q^' represents the hybridizing- 
feature vector by combining the GO frequency features and the 
SS features derived from the l-th SS measure. We refer them to as 
Hybridl, Hybrid2 and Hybrid3, respectively. 

Multi-label Multi-class SVM Classification 

The hybridized-feature vectors obtained from the previous 
subsection are used for training multi-label one-vs-rest support 
vector machines (SVMs). Specifically, for an M-class problem 
(here M is the number of subcellular locations), M independent 
binary SVMs are trained, one for each class. Denote the hybrid 
GO vectors of the t-th query protein using the l-th SS measure as 
cff' . Given the t-th query protein Q,, the score of the m-th SVM 
using the l-th SS measure is 



,qf')+b„ 



(11) 



riESn 



where q ; ' is the hybrid GO vector derived from Q/ (See Eq. 10), 
S m j is the set of support vector indexes corresponding to the m-th 
SVM, a m , r are the Lagrange multipliers, y m ,r^{ — 1> + 1} indicates 
whether the r-th training protein belongs to the m-th class or not, 
and K{-, ) is a kernel function. Here, the linear kernel was used. 

Unlike the single-label problem where each protein has one 
predicted label only, a multi-label protein could have more than 
one predicted labels. In this work, we compared two different 
decision schemes for this multi-label problem. In the first scheme, 
the predicted subcellular location(s) of the ?-th query protein are 
given by 



-H*(Q<)= 

'U"=i{»>:^(Q()>0} J when 3 me{l,...,M} s.t. * m> /(Q,)>0; 
argmax^Lj s m> /(Q,), otherwise. 



(12) 



The second scheme is an improved version of the first one in 
that the decision threshold is dependent on the test protein. 
Specifically, the predicted subcellular location(s) of the i-th query 
protein are given by: 

If 3 .w(Q0>0, 



M,(Q,)= Q(m:. V /(Q ( )>min{1.0/(.y max , / (Q0)}) (13) 



otherwise, 
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M(Q () = arg max s m> /(Q ,). 



(14) 



for the i-th protein Q, (i= 1, . . . ,N), respectively. Here, N = 201 
for the virus dataset and N = 978 for the plant dataset. Then the 
five measurements are defined as follows: 



In Eq. 13, /(Smax 
W,/(Q«)= max m=l 

function as follows: 



KQt)) i s a function of Jmax,;(Q()> where 
S m ,/(Q;)- In this work, we used a linear 



Accuracy = — ^ 



^frf\IMQf)UAQ,-)l 



(16) 



/(w,/(Q»))=^w,/(Q»). 



(15) 



Precision - 



f\M«tonc(Sto\ 



where 8e[0. 0,1.0] is a hyper-parameter that can be optimized 
through cross-validation. 

In fact, besides SVMs, many other machine learning models, 
such as hidden Markov models (HMMs) and neural networks 
(NNs) [75,76], have been used in protein subcellular-localization 
predictors. However, HMMs and NNs are not suitable for GO- 
based predictors because of the high dimensionality of GO vectors. 
The main reason is that under such condition, HMMs and NNs 
can be easily overtrained and thus lead to poor performance. On 
the other hand, linear SVMs can well handle high-dimensional 
data because even if the number of training samples is smaller than 
the feature dimension, linear SVMs are still able to find an optimal 
solution. 

Materials and Performance Metrics 

Datasets 

In this paper, a virus dataset [35,37] and a plant dataset [36] 
were used to evaluate the performance of the proposed predictor. 
The virus and the plant datasets were created from Swiss-Prot 57.9 
and 55.3, respectively. The virus dataset contains 207 viral 
proteins distributed in 6 locations. Of the 207 viral proteins, 165 
belong to one subcellular locations, 39 to two locations, 3 to three 
locations and none to four or more locations. This means that 
about 20% of the proteins in the dataset are located in more than 
one subcellular location. The plant dataset contains 978 plant 
proteins distributed in 12 locations. Of the 978 plant proteins, 904 
belong to one subcellular locations, 7 1 to two locations, 3 to three 
locations and none to four or more locations. The sequence 
identity of both datasets was cut off at 25%. 

The breakdown of these two datasets are listed in Figs. 2(a) and 
2(b). Fig. 2(a) shows that the majority (68%) of viral proteins in the 
virus dataset are located in host cytoplasm and host nucleus while 
proteins located in the rest of the subcellular locations totally 
account only around one third. This means that this multi-label 
dataset is imbalanced across the six subcellular locations. Similar 
conclusions can be drawn from Fig. 2(b), where most of the plant 
proteins exist in chloroplast, cytoplasm, nucleus and mitochon- 
drion while proteins in other 8 subcellular locations totally account 
for less than 30%. This imbalanced property makes the prediction 
of these two multi-label datasets difficult. These two benchmark 
datasets are downloadable from the hyperlinks in the HybridGO- 
Loc server. 

Performance Metrics 

Compared to traditional single-label classification, multi-label 
classification requires more complicated performance metrics to 
better reflect the multi-label capabilities of classifiers. Convention- 
al single-label measures need to be modified to adapt to multi-label 
classification. These measures include Accuracy, Precision, Recall, Fl- 
score (Fl) and Hamming Loss (HL) [77,78]. Specifically, denote 
£(Q;) all d A<(Q,-) as the true label set and the predicted label set 



N{^\ \M{f^)\ 



IAQ,)I 



Fl : 



/ 2|.M(Q,)nAQ,)l 
A^VlMQ,)l + IAQ,)l 



l N 



(17) 



(18) 



(19) 



where | ■ | means counting the number of elements in the set therein 
and P| represents the intersection of sets. 

Accuracy, Precision, Recall and Fl indicate the classification 
performance. The higher the measures, the better the prediction 
performance. Among them, Accuracy is the most commonly used 
criteria. Fl -score is the harmonic mean of Precision and Recall, which 
allows us to compare the performance of classification systems by 
taking the trade-off between Precision and Recall into account. The 
Hamming Loss (HL) [77,78] is different from other metrics. As can 
be seen from Eq. 20, when all of the proteins are correcdy 
predicted, i.e., |-M(Q,)U£(Q,)| = |MQ,)nAQ,)l (i = 1, • ■ ■ M 
then HL = 0; whereas, other metrics will be equal to 1. On the 
other hand, when the predictions of all proteins are completely 
wrong, i.e., |A4(Q,)UAQ,)I =M and |-M(Q,)n£(Q,-)| =0, then 
HL = 1; whereas, other metrics will be equal to 0. Therefore, the 
lower the HL, the better the prediction performance. 

Two additional measurements [37,39] are often used in multi- 
label subcellular localization prediction. They are overall locative 
accuracy (OLA) and overall actual accuracy (OAA). The former is 
given by: 

ola = ^ l m )\ % m ^ wm ' (21) 

and the overall actual accuracy (OLA) is: 



oAA=-^A[Mmcm 



where 



A[7W(Q,.),£(Q i )]: 



1 ,if A4(Q,) = AQ,) 
0 , otherwise. 



(22) 



(23) 
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Figure 2. Breakdown of the (a) virus and (b) plant datasets. The number of proteins shown in each subcellular location represents the number 
of 'locative proteins' [37,39]. For (a), there are 207 actual proteins and 252 locative proteins; For (b), there are 978 actual proteins and 1055 locative 
proteins. 

doi:1 0.1 371 /journal.pone.0089545.g002 



According to Eq. 21, a locative protein is considered to be 
correctly predicted if any of the predicted labels matches any labels 
in the true label set. On the other hand, Eq. 22 suggests that an 
actual protein is considered to be correctly predicted only if all of 
the predicted labels match those in the true label set exactly. For 
example, for a protein coexist in, say three subcellular locations, if 
only two of the three are correctly predicted, or the predicted 
result contains a location not belonging to the three, the prediction 
is considered to be incorrect. In other words, when and only when 
all of the subcellular locations of a query protein are exacdy 
predicted without any overprediction or underprediction, can the 
prediction be considered as correct. Therefore, OAA is a more 
stringent measure as compared to OLA. OAA is also more objective 
than OLA. This is because locative accuracy is liable to give biased 
performance measures when the predictor tends to over-predict, 
i.e., giving large |.M(Q,-)| for many Qy. In the extreme case, if 
every protein is predicted to have all of the M subcellular 
locations, according to Eq. 20, the OLA is 100%. But obviously, 
the predictions are wrong and meaningless. On the contrary, OAA 
is 0% in this extreme case, which definitely reflects the real 
performance. 

Among all the metrics mentioned above, OAA is the most 
stringent and objective. This is because if only some (but not all) of 
the subcellular locations of a query protein are correctly predict, 
the numerators of the other 4 measures (Eqs. 16 to 21) are non- 
zero, whereas the numerator of OAA in Eq. 22 is 0 (thus contribute 
nothing to the frequency count). 

In statistical prediction, there are three methods that are often 
used for testing the generalization capabilities of predictors: 
independent tests, sub-sampling tests (or K-io\A cross-validation) 
and leave-one-out cross validation (LOOCV). For independent 
tests, the selection of independent dataset often bears some sort of 
arbitrariness [79]; for the K-io\A cross validation, different 
partitioning of a dataset will lead to different results, thus still 
being liable to statistical arbitrariness; for LOOCV, it will yield a 
unique outcome and is considered to be the most rigorous and 
bias-free method [80]. Hence, LOOCV was used to examine the 
performance of all predictors in this work. More detailed analysis 
of the statistical methods can be found in the supplementary 
materials. Note that the jackknife cross validation in iLoc-Plant 
and its variants is the same as LOOCV, as mentioned in [36,79]. 
Because the term jackknife also refers to the methods that estimate 



the bias and variance of an estimator [81], to avoid confusion, we 
only use the term LOOCV in this paper. 

Results 

Comparing Different Features 

Fig. 3(a) shows the performance of individual and hybridized 
GO features on the virus dataset based on leave-one-out cross 
validation (LOOCV). In the figure, SSI, SS2 and SS3 represent 
Lin's, Jiang's and RS similarity measures, respectively. Hybridl, 
Hybrid2 and Hybrid3 represent the hybridized features obtained 
from these measures. As can be seen, in terms of all the six 
performance metrics, the performance of the hybrid features is 
remarkably better than the performance of individual features, 
regardless of which of the GO frequency features or the three GO 
SS features were used. Specifically, the OAAs (the most stringent 
and objective metric) of all of the three hybrid features are at least 
3% (absolute) higher than that of the individual features, which 
suggests that hybridizing the two features can significandy boost 
the prediction performance. Moreover, among the hybridized 
features, the performance of Hybrid2, namely combining GO 
frequency features and GO SS features by Jiang's measure, 
outperforms Hybridl and Hybrids. Another interesting thing is that 
although all of the individual GO SS features perform much worse 
than the GO frequency features, the performance of the three 
hybridized features is still better that of any of the individual 
features. This suggests that the GO frequency features and SS 
features are complementary to each other. 

Similar conclusions can be drawn from the plant dataset shown 
in Fig. 3(b). However, comparison between Fig. 3(a) and Fig. 3(b) 
reveals that for the plant dataset, the performance of hybridized 
features outperforms all of the individual features in terms of all 
metrics except OLA and Recall, while for the virus dataset, the 
former is superior to the latter in terms of all metrics. However, the 
losses in these two metrics do not outweigh the significant 
improvement on other metrics, especially on OAA, which has 
around 3% (absolute) improvement in terms of hybridized features 
as opposed to using individual features. Among the hybridizing 
features, Hybrid2 also outperforms Hybridl and Hybrid3 in terms of 
OLA, Accuracy, Recall and Fl -score, whereas Hybridl performs better 
than others in terms of OAA and Precision. These results 
demonstrate that the GO SS features obtained by Lin's measure 
and Jiang's measure are better candidates than the RS measure for 
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combining with the GO frequency features; however, there is no 
evidence suggesting which measure is better. It is also interesting to 
see that the performance of the three individual GO SS features is 
better than that of GO frequency features, in contrary to the 
results shown in Fig 3(a). 

Comparing with State-of-the-Art Predictors 

Table 1 and Table 2 compare the performance of the proposed 
predictor against several state-of-the-art multi-label predictors on 
the virus and plant dataset based on leave-one-out cross validation. 
Note that we used the best performing hybridizing features with 
the adaptive decision strategy. Specifically, for both the virus and 
plant datasets, the best performance was achieved when Hybrid.2 
and the adaptive decision strategy with 0 = 0.3 were used. 0 was 
determined by cross-validation as stated previously. Unless stated 
otherwise, we used Hybrid2 to represent HybridGO-Loc in 
subsequent experiments. Our proposed predictor use the GO 
frequency features and GO semantic similarity features, whereas 
other predictors use only the GO frequency of occurrences as 
features. From the classification perspective, Virus-mPLoc [35] 
uses an ensemble OET-KNN (optimized evidence-theoretic K- 
nearest neighbors) classifier; iLoc-Virus [37] uses a multi-label 
KNN classifier; KJMN-SVM [38] uses an ensemble of classifiers 
combining KNN and SVM; mGOASVM [39] uses a multi-label 
SVM classifier; and the proposed predictor use a multi-label SVM 
classifier incorporated with the adaptive decision scheme. 

As shown in Table 1, the proposed predictor perform 
significantly better than the other predictors. The OAA and OLA 
of the proposed predictor are more than 15% (absolute) higher 
than that of iLoc-Virus and Virus-mPLoc. It also performs 
significantly better than KNN-SVM in terms of OLA. When 
comparing with mGOASVM, the proposed predictor performs 
remarkably better in of all of the performance metrics, especially 
for the OAA (0.937 vs 0.889). These results demonstrate that 
hybridizing the GO frequency features and GO SS features can 
significantly boost prediction performance, which also suggests 
that these two kinds of information are proved to be complemen- 
tary to each other in terms of predicting subcellular localization. 
Similar conclusions can be drawn for the plant dataset from 



Table 2 except that the OLA of the proposed predictor is slightly 
worse than that of mGOASVM, and the Recall is equivalent to that 
of mGOASVM. Nevertheless, the small losses do not outweigh the 
impressive improvement in the other metrics, especially in the OAA 
(0.936 vs 0.874). 

Prediction of Novel Proteins 

To further demonstrate the effectiveness of HybridGO-Loc, a 
newer plant dataset constructed for mGOASVM [39] was used to 
compare with state-of-the-art multi-label predictors using inde- 
pendent tests. Specifically, this new plant dataset contains 175 
plant proteins, of which 147 belong to one subcellular location, 27 
belong to two locations, 1 belong to three locations and none to 
four or more locations. These plant proteins were added to Swiss- 
Prot between 08-Mar-2011 and 18-Apr-2012. Because the plant 
dataset used for training the predictors was created on 29-Apr- 
2008, there is an almost 3-year time gap between the training data 
and test data in our experiments. 

Table 3 compare the performance of HybridGO-Loc against 
several state-of-the-art multi-label plant predictors on the new 
plant dataset. All the predictors use the 978 proteins of the plant 
dataset (See Fig. 2(b)) for training the classifier and make 
independent test on the new 175 proteins. As can be seen, 
HybridGO-Loc performs significantly better than all the other 
predictors in terms of all of the performance metrics. Similar 
conclusions can also be drawn from the performance in individual 
subcellular locations. 

Fig. 4 shows the distribution of the E-values of the test proteins, 
which were obtained by using the training proteins as the 
repository and the test proteins as the query proteins in the 
BLAST search. If we use a common criteria that homologous 
proteins should have E-value less than 10~ 4 , then 74 out of 175 
test proteins are homologs of the training proteins, which account 
for 42 % of the test set. Note that this homologous relationship does 
not mean that using BLAST'S homology transfers can predict all of 
the 74 test proteins correctly. In fact, BLAST'S homology transfers 
(based on the CC field of the homologous proteins) can only 
achieve a prediction accuracy of 26.9% (47/175). As the 




Figure 3. Performance of the hybrid features and individual features on the (a) virus and (b) plant datasets. Freq: GO frequency 
features; SS1, SS2 and SS3: GO semantic similarity features by using Lin's measure [51], Jiang's measure [74] and RS measure [52], respectively; Hybridi, 
Hybrid2 and Hybrid3: GO hybrid features by combining GO frequency features with GO semantic similarity features based on SSI, SS2 and SS3, 
respectively. 

doi:1 0.1 371 /journal.pone.0089545.g003 



PLOS ONE | www.plosone.org 



7 



March 2014 | Volume 9 | Issue 3 | e89545 



HybridGO-Loc 



Table 1. Comparing the proposed predictor with state-of-the-art multi-label predictors based on leave-one-out cross validation 
(LOOCV) using the virus dataset. 



Label Subcellular Location LOOCV Locative Accuracy (LA) 



— — 


Virus-mPLoc [35] 


KNN-SVM [38] 


iLoc-Virus [37] 


mGOASVM [39] 


HybridGO-Loc 


1 Viral capsid 


o/o _ i nnn 

0/ 0 — 1 .uuu 


p/o _ 1 nnn 

0/0 — 1 .uuu 


r/r — 1 nnn 

0/0 — 1 .uuu 


r/r — 1 nnn 

O/O — I .uuu 


r/r — 1 nnn 

O/O — 1 .uuu 


2 Host cell mGimbranG 


1 Q/33 — fl ^7fi 


T7/33 — nR1R 
Z/ / DD — u.o 1 0 


tc;/33— n7^R 


39/33 — n Q7n 


31/33 — n Q7n 

JZ/JJ — U.I7/U 


j nubi r_r\ 


1 i/in — n ft^n 

1 0/ ZU — U.DjU 


1 c /Tfi _ n Ten 

1 J/ZU — U./ JU 


1 Df Z.KJ — U./ JU 


1 7/9n — n R^n 

1 / / ZU — U.DJU 


1 R/9n — n onn 


4 Host cytoplasm 


52/87 = 0.598 


86/87 = 0.988 


64/87 = 0.736 


85/87 = 0.977 


85/87 = 0.966 


5 Host nucleus 


51/84 = 0.607 


54/84 = 0.651 


70/84 = 0.833 


82/84 = 0.976 


82/84 = 0.988 


6 Secreted 


9/20 = 0.450 


1 3/20 = 0.650 


1 5/20 = 0.750 


20/20=1.000 


20/20 = 1 .000 


Overall Locative Accuracy (OLA) 


1 52/252 = 0.603 


203/252 = 0.807 


197/252=0.782 


244/252 = 0.968 


245/252 = 0.972 


Overall Actual Accuracy (OAA) 






155/207 = 0.748 


184/207 = 0.889 


1 94/207 = 0.937 


Accuracy 








0.935 


0.961 


Precision 








0.939 


0.965 


Recall 








0.973 


0.976 


F1 








0.950 


0.968 


HL 








0.026 


0.016 



"-" means the corresponding references do not provide the results on the respective metrics. Host ER: Host endoplasmic reticulum. 
doi:1 0.1 371 /journal.pone.0089545.t001 



prediction accuracy of HybridGO-Loc on this test set (see Table 3) 
is significantly higher than this percentage, the extra information 
available from the GOA database plays a very important role in 
the prediction. 



Discussion 

Semantic Similarity Measures 

In this paper, we have compared three of the most common 
semantic similarity measures for subcellular localization, including 



Table 2. Comparing the proposed predictor with state-of-the-art multi-label predictors based on leave-one-out cross validation 
(LOOCV) using the plant dataset. 



Label Subcellular Location LOOCV Locative Accuracy (LA) 







Plant-mPLoc [34] 


iLoc-Plant [36] 


mGOASVM [39] 


HybridGO-Loc 


1 


Cell membrane 


24/56 = 0.429 


39/56 = 0.696 


53/56 = 0.946 


51/56 = 0.911 


2 


Cell wall 


8/32 = 0.250 


19/32 = 0.594 


27/32 = 0.844 


28/32 = 0.875 


3 


Chloroplast 


248/286 = 0.867 


252/286 = 0.881 


272/286 = 0.951 


278/286=0.972 


4 


Cytoplasm 


72/182 = 0.396 


114/182 = 0.626 


1 74/1 82 = 0.956 


168/182=0.923 


5 


Endoplasmic reticulum 


1 7/42 = 0.405 


21/42 = 0.500 


38/42 = 0.905 


38/42 = 0.905 


6 


Extracellular 


3/22 = 0.136 


2/22 = 0.091 


22/22 = 1 .000 


21/22 = 0.955 


7 


Golgi apparatus 


6/21 =0.286 


16/21 =0.762 


19/21 =0.905 


19/21 =0.905 


8 


Mitochondrion 


114/150 = 0.760 


112/150 = 0.747 


1 50/1 50 = 1 .000 


149/150=0.993 


9 


Nucleus 


136/152 = 0.895 


140/152 = 0.921 


151/152 = 0.993 


150/152 = 0.987 


10 


Peroxisome 


14/21 =0.667 


6/21 =0.286 


21/21 =1.000 


21/21 =1.000 


11 


Plastid 


4/39 = 0.103 


7/39 = 0.179 


39/39=1.000 


38/39 = 0.974 


12 


Vacuole 


26/52 = 0.500 


28/52 = 0.538 


49/52 = 0.942 


48/52 = 0.923 


Overall Locative Accuracy {OLA) 


672/1055 = 0.637 


756/1055 = 0.717 


1015/1055 = 0.962 


1009/1055 = 0.956 


Overall Actual Accuracy {OAA) 




666/978 = 0.681 


855/978 = 0.874 


915/978 = 0.936 


Accuracy 








0.926 


0.959 


Precision 








0.933 


0.972 


Recall 








0.968 


0.968 


Fl 








0.942 


0.966 


HL 








0.013 


0.007 


"-" means 


the corresponding references do 


not provide the results on 


the respective metrics. 
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Table 3. Comparing HybridGO-Loc with state-of-the-art multi-label plant predictors based on independent tests using the new 
plant dataset. 



1 iUqI 

i_a Dei 


Subcellular Location 


Independent Test Locative Accuracy 










Plant-mPLoc [34] 


iLoc-Plant [36] 


muUHjVIVI i-jyj 


nyDrluijU-LOC 


i 


Cell membrane 


8/16 = 0.500 


1/16 = 0.063 


7/16 = 0.438 


16/16=1.000 


2 


Cell wall 


0/1 =0 


0/1 =0 


0/1 = 0% 


1/1 =1.000 


3 


Chloroplast 


27/54 = 0.500 


45/54 = 0.833 


39/54 = 0.722 


30/54 = 0.556 


4 


Cytoplasm 


5/38 = 0.132 


15/38 = 0.395 


19/38 = 0.500 


31/38 = 0.816 


5 


Endoplasmic reticulum 


1/9 = 0.111 


1/9 = 0.111 


3/9 = 0.333 


4/9 = 0.444 


6 


Extracellular 


0/3 = 0 


0/3 = 0 


1/3 = 0.333 


0/3 = 0 


7 


Golgi apparatus 


3/7 = 0.429 


1/7 = 0.143 


3/7 = 0.429 


7/7 = 1 .000 


8 


Mitochondrion 


6/16 = 0.375 


3/16 = 0.188 


11/16 = 0.688 


16/16=1.000 


9 


Nucleus 


31/46 = 0.674 


43/46 = 0.935 


33/46 = 0.717 


44/46 = 0.957 


10 


Peroxisome 


4/6 = 0.667 


0/6 = 0 


3/6 = 0.500 


4/6 = 0.667 


11 


Plastid 


0/1 =0 


0/1 =0 


0/1 =0 


0/1 =0 


12 


Vacuole 


2/7=0.286 


4/7 = 0.571 


4/7 = 0.571 


7/7 = 1 .000 


Overall Locative Accuracy {OLA) 


87/204 = 0.427 


1 1 3/204 = 0.554 


123/204 = 0.603 


1 60/204 = 0.784 


Overall Actual Accuracy (OAA) 


60/1 75 = 0.343 


91/175 = 0.520 


97/175 = 0.554 


127/175 = 0.726 


Accuracy 




0.417 


0.574 


0.594 


0.784 


Precision 




0.444 


0.626 


0.630 


0.826 


Recall 




0.474 


0.577 


0.609 


0.798 


F1 




0.444 


0.592 


0.611 


0.803 


HL 




0.116 


0.076 


0.075 


0.037 



doi:1 0.1 371 /joumal.pone.0089545.t003 



• Bl) Inter-term relationship. SS vectors are based on inter- 
term relationships. They are defined on a space in which 
each basis corresponds to one training protein and the 
coordinate along that basis is defined by the semantic 
similarity between a testing protein and the 
corresponding training protein. 

• B2) Inter-group relationship. The pairwise relationships 
between a test protein and the training proteins are 
hierarchically structured. This is because each basis of 
the SS space depends on a group of GO terms of the 
corresponding training protein, and the terms are 
arranged in a hierarchical structure (parent- child 
relationship). Because the GO terms in different groups 
are not mutually exclusive, the bases in the SS space are 
not independent of each other. 

Bias Analysis 

Except for the new plant dataset, we adopted LOOCV to 
examine the performance of all predictors in this work, which is 
considered to be the most rigorous and bias-free [80]. Neverthe- 
less, determining the set of distinct GO terms W from a dataset is 
by no means without bias, which may favor the LOOCV 
performance. This is because the set of distinct GO terms W 
derived from a given dataset may not be representative for other 
datasets; in other words, the generalization capabilities of the 
predictors may be weakened when new GO terms outside W are 
found in the test proteins. 

However, we have the following strategies to minimize the bias. 
First, the two benchmark datasets used in this paper were 
constructed based on the whole Swiss-Prot database (although in 
different years), which, to some extent, incorporated all the 



Lin's measure [51], Jiang's measure [74], and relevance similarity 
measure [52]. We excluded Resnik's measure because it ignores 
the distance between the terms and their common ancestors in the 
GO hierarchy. In addition to these measures, many online tools 
are also available for computing the semantic similarity at the GO- 
term level and gene-product level [44,82-84]. However, these 
measures are discrete measures whereas the measures that we used 
are continuous. Research has shown that continuous measures are 
better than discrete measures in many applications [48]. 

GO-Frequency Features versus SS Features 

Note that we do not replace the GO frequency vectors. Instead, 
we augment the GO frequency feature with a more sophisticated 
feature, i.e. the GO SS vectors, which are to be combined with the 
GO frequency vectors. A GO frequency vector is found by 
counting the number of occurrences of every GO term in a set of 
distinct GO terms obtained from the training dataset, whereas an 
SS vector is constructed by computing the semantic similarity 
between a test protein with each of the training proteins at the 
gene-product level. That is, each element in an SS vector 
represents the semantic similarity of two GO-term groups. This 
can be easily seen from their definitions in Eq. 2 and Eq. 4—9, 
respectively. 

The GO frequency vectors and the GO SS vectors are different 
in two fundamental ways. 

• A). GO frequency vectors are more primitive in the sense that 
their elements are based on individual GO terms without 
considering the inter-term relationship, i.e., the elements in a 
GO frequency vectors are independent of each other. 

• B). GO SS vectors are more sophisticated in the following two 
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Figure 4. Distribution of the closeness between the new testing proteins and the training proteins. The closeness is defined as the BLAST 
E-values of the training proteins using the test proteins as the query proteins in the BLAST searches. Number of Proteins: The number of testing 
proteins whose E-values fall into the interval specified under the bar. Small E-values suggest that the corresponding new proteins are close homologs 
of the training proteins. 
doi:1 0.1 371 /journal.pone.0089545.g004 



possible information of plant proteins or virus proteins in the 
database. In other words, W was constructed based on all of the 
GO terms corresponding to the whole Swiss-Prot database, which 
enables W to be representative for all of the distinct GO terms. 
Second, these two benchmark datasets were collected according to 
strict criteria. Details of the procedures can be found in the 
supplementary materials, and the sequence similarity of both 
datasets was cut off at 25%, which enables us to use a small set of 
representative proteins to represent all of the proteins of the 
corresponding species (i.e., virus or plant) in the whole database. In 
other words, W will vary from species to species, yet still be 
statistically representative for all of the useful GO terms for the 
corresponding species. Third, using W for statistical performance 
evaluation is equivalent or at least approximate to using all of the 
distinct GO terms in the GOA database. This is because other GO 
terms that do not correspond to the training proteins will not 
participate in training the linear SVMs, nor will they play essential 
roles in contributing to the final predictions. In other words, the 
generalization capabilities of HybridGO-Loc will not be weakened 
even if some new GO terms are found in the test proteins. A 
mathematical proof of this statement can be found in the 
supplementary materials available in the HybridGO-Loc server. 

One may argue that the performance bias might arise when the 
whole W was used to construct the hybrid GO vectors for both 
training and testing during cross validation. This is because, in 
each fold of the LOOCV, the training proteins and the singled-out 
test protein will use the same W to construct the GO vectors, 
meaning that the SVM training algorithm can see some 
information of the test protein indirecdy through the GO vector 
space defined by W. It is possible that for a particular fold of 
LOOCV, the GO terms of a test protein do not exist in any of the 
training proteins. However, we have mathematically proved that 



this bias will not exist during LOOCV (see the accompanying 
supplementary materials for the proof). Furthermore, the results of 
the independent tests (See Table 3) for which no such bias occurs 
also strongly suggest that HybridGO-Loc outperforms other 
predictors by a large margin. 

Conclusions 

This paper proposes a new multi-label predictor by hybridizing 
GO frequency features and semantic similarity features to predict 
the subcellular locations of multi-label proteins. Three different 
semantic similarity measures have been investigated to be 
combined with GO frequency features to formulate GO hybrid 
feature vectors. The feature vectors are subsequently recognized 
by multi-label multi-class support vectors machine (SVM) classi- 
fiers equipped with an adaptive decision strategy that can produce 
multiple class labels for a query protein. Compared to existing 
multi-label subcellular-localization predictors, our proposed pre- 
dictor has the following advantages: (1) it formulates the feature 
vectors by hybridizing GO frequency of occurrences and GO 
semantic similarity features which contains richer information 
than only GO term frequencies; (2) it adopts a new strategy to 
incorporate richer and more useful homologous information from 
more distant homologs rather than using the top homologs only; 
(3) it adopts an adaptive decision strategy for multi-label SVM 
classifiers so that it can effectively deal with datasets containing 
both single-label and multi-label proteins. Experimental results 
demonstrate the superiority of the proposed hybrid features over 
each individual features. It was also found that the proposed 
predictor performs remarkably better than existing state-of-the-art 
predictors. For readers' convenience, HybridGO-Loc is available 
online at http://bioinfo.eie.polyu.edu.hk/HybridGoServer/. 
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